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ABSTRACT 



An economic, scalable machine learning system and process 
perform document (concept) classification with high accu- 
racy using large topic schemes, including large hierardiical 
topic schemes. One or more highly relevant classification 
topics is suggested for a-given document (concept) to be 
classified. The invention includes training and concept clas- 
sification processes. The invention also provides methods 
that may be used as part of the training and/or concept 
classification processes, including: a method of scoring the 
relevance of features in training concepts, a method of 
ranking concepts based on relevance score, and a method of 
voting on topics associated with an input concept. In a 
preferred embodiment, the invention is applied to the legal 
(case law) domain, classifying legal concepts (rules of law) 
according to a proprietary legal topic classification scheme 
(a hierarchical scheme of areas of law). 

20 Claims, 8 Drawing Sheets 
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SYSTEM AND METHOD FOR CLASSIFYING feedback, have been tried for this task [Buckley 1994, Lewis 

LEGAL CONCEPTS USING LEGAL TOPIC 1994, and Mitchell 1997]. In addition to the effectiveness of 

SCHEME learning methods themselves, the success of automatic cat- 
egorization depends on the number of topics in the scheme. 

This application claims the benefit of Provisional Appli- 5 on the amount of quality training documents, and on the 

cation No. 60/147,389, filed Aug. 6, 1999. degree that the topics are mutually exclusive to one another. 

An example is disclosed in U.S. Pat. No. 5,675,710. 

COPYRIGHT NOTICE jjjg difiScuh document classification centers on 

A portion of this disclosure, including Appendices, is classifying documents using a hierarchical topic scheme. In 

subject to copyright protection. Limited permission is this task, one has to consider horizontal relationships among 

granted to facsimile reproduction of the patent document or sister topics, which tend to be close to each other and are 

patent disclosure as it appears in the U.S. Patent and confusing to a computer. Moreover, one must also be 

Trademaik OfiSce (PTO) patent file or records, but the concerned with vertical inheritance relationships, 

copyright owner reserves all other copyright rights whatso- Many machine learning techniques have trouble accom- 

ever. modating these two semantic relationships simultaneously 

in their learning or training, and thereafter have difficulty in 

BACKGROUND OF THE INVENTION classifying documents effectively. The task becomes more 

1. Field of the Invention challenging if the topic scheme is very large, if the training 
' . . , J . J 1. documents are not topically exclusive, if the size of docu- 

The present mvention relates to systems and methods for 20 ^^^^^ ^ ^ documents lack descriptive infor- 

automated classification. More specifically, the mvention mation 

relates to automated systems and methods for classifying . ,„ ^to«.*.t 

concepts (such as legal concepts, including points of law , To face these challenges some techniques (VS Pat No. 

from Lit opinions) according to a topic scLme (such ^ i^J^V.^T.^^^^^ 

hierarchical legal topic clas^cation scheme). 25 ^^l^^?'^'^^,^ '""^k ' ^""c "^nf^^^ ^Tc'^^^'^t 

^ *^ ' ing. Still others (U.S. Pat. Nos. 5,371,807 and 5,768,580) 

2. Related Art ^^^^ hnguistic knowledge to combat the ambiguity intro- 
Document classification has long been recognized as one duced in the hierarchical scheme. 

of the most important tasks in text processing. Classific^^^^ However, these techniques can only handle small, 

of docurnents provides for quahty documem retrieval, and domain-specific classificatk)n work. They have difficulty in 

enables browsmg and hnkmg among documents acr(^ a ^^^^ processing, either because of their simplicity in 

coUection. The benefits of such easy access are especiaUy ^^^^ recognition or because of the daunting demand of 

apparent m slowly-evolvmg subject domains such as law ^^^^^ expensive lexicons to support the linguistic parsing. 

The generally stable vocabulanes and topics of the legal ^ . . . - . . i 

dom^n insure long-term letum on any classification work. Thus, there is a need in the art to develop an economic. 

^ , ^ - 1 ... . '35 scalable machine learmng process that can perform docu- 

niere are two broad documem classificaUon approach^: ^^^^ classification with high accuracy using a large, hier- 

unsupervised learnmg and supervised learning The ^^^j^^j ^^^^^^ ^ ^^^^ ^^^^ 

approaches are differenUated by whether a pre-defined clas- invention L directed 

sification scheme is used. Non-Patent References mentioned above: 

Unsupervised learning is a data-driven classification ^ go^j^o^ h. and Bemick M. 1963. "Automatic document 

approach, based on the assumption that documents can be classification." Jowr/ia/ of the Association for Computing 

well organized by a natural structure inherent to the data. Machineryy pp. 151-161. 

Those famiUar with the data should be able to follow this s^9x<^ Jones,' K. 1970. "Some thoughts on classification for 

natural structure to locate their information. A large body of retrieval" Journal of Documentation, pp.89-102. 

information retrieval literature has focused on this approach, 45 Van Riusbeigen, C. J. 1979. Information Retrieval 2nd 

mostly related to document clustering [Borko 1963, Sparck edition, Butterworths, London. 

Jones 1970, van Rijsbergen 1979, GriflSths 1984, Willett Griffiths, A and others. 1984. "Hierarchic agglomerative 

1988, Salton 1990]. More recently some machine learning clustering methods for automatic document classifica- 

techniques have been appUed to this classification task Honr Journal of Documentation, ^p. 175-205. 

[Farkas 1993}-the term "unsupervised learning" was 5^ ^iUett, R 1988. "Recent trends in hierarchic document 

coined to describe this approach. The following patents are clustering: A critical t^wit^rinformation Processing and 

associated with this approach: U.S. Pat. No. 5,182,708 and Management y pp. 577-598. 

U.S. Pat. No. 5,832,470. Salton, G. and Buckley C. 1990. "Flexible text matching for 

Opposite to the unsupervised learning approach to docu- information retrieval." Technical Report 90-1158, Cor- 

ment classification is supervised learning. With this' 55 qcU University, Ithaca, N.Y. 

approach, a pre-defined "topic scheme" is given, along with Farkas, J. 1993. "Neural networks and document classifica- 

the classified documents for each topic in the scheme. The tion." Canadian Conference on Electrical and Computer 

topic scheme may be a simple list of discrete topics, or a Engineering, pp. 1—4. 

complex hierarchical topic scheme. Supervised learning Buckley, C and others. 1994. "Automatic routing and ad-hoc 

technology focuses on the task of feeding a computer ^ Tcin€v^\^ismgSMART.TK£C'^" The 2nd Text Retrieval 

meaningful topical descriptions so that it can learn to Conference, edited by Donna Harman, NIST Special 

classify a document of unknown type. Pubhcation 500-215, pp.45-55. 

When a topic scheme includes a simple list of discrete Lewis, D. D. and Gale, W. A. 1994. "A sequential algorithm 

topics (one without a complex hierarchical relationships for training text classifiers." Proceedings of the 1th 

among the topics), the document classification becomes 65 Annual International ACM-SIGIR Conference on 

mere document categorization. Many machine learning Research arul Development in Information Retrieval^ 

techniques^ including the retrieval technique of relevance* pp3-12, London. 
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Mitchell, T. 1997. Machine Learning, McGraw Hill, New employed for the sake of clarity. However, the invention is 

York. not intended to be limited to the specific terminology so 

« , A «w ^r. ,^ r^^^^^r selected, and it is to be understood that each specific element 

SUMMARY OF THE INVENTION includes all technical equivalents that operate in a similar 

The inventive system and method provide an economic, 5 manner to accomplish a similar purpose, 

scalable machine learning process that performs document Badcground Terinmolpgy. 

classification with high accuracy using large topic schemes, ^ background to understanding the mventive system and 

including large hierarchical topic schemes. More njcthcd, it is understood that preferred embody 

specifically, the inventive system and method suggest one or ^"^al concepts m a case law docjmienti^urt opinion), 

^ u- ui 1 * 1 • r • J nsmg machine-based leammg techniques. Then, the classi- 

more highly relevant classification topics for a given docu- lO ^J^^ ^ ^^^.^^ ^ predefined 

ment to be classified. ^^^^ «j^g^ ^^p.^^ „ 

The invention provides several features, including novel particular, a "concept" in the legal domain, called a 

trammg and concept classification processes. The invention «iegai concept," may be more widely known as a "rule of 

also provides novel methods that may be used as part of the law." A topic in the legal domain, caUed a "legal topic," may 

training and/or concept classification processes, including: a 15 ^^^^ ^^^j^ ^ ^ ^^^^^^ 

method of scoring the relevance of feattires in training invention can be applied to domains other than the legal 

concepts, a method of ranking concepts based on relevance domain, and the broad terms "concept" and "topic" should 

score, and a method of voting on topics associated with an not be limited to the legal domain. 

mput concept. Referring to the particular embodiment that is applied to 
In a preferred embodiment, the invention is applied to the the legal domain, case law documents must each have a set 
legal (case law) domain, classifying legal concepts (such as of distinct legal concepts. In a particular preferred 
rules of law) according to a proprietary legal topic classifi- embodiment, "legal concepts" may be defined as "control- 
cation scheme (a hierarchy of areas of law). ling points of law, material to the disposition of the case, 
Other objects, feadires and advantages of the present ^ stated in the language of the court." Typically, a judicial 
invention will be apparent to those skilled in the art upon a opinion passage contains a legal concept if: 
reading of this specification including the accompanying 1. The passage is a positive statement of a rule of law, 
drawings. such as: 

BRIEF DESCRIPTION OF TUB DRAWINGS ^ » f^ZT^.^^^ X ^r^S^n't' !£ 

The invention is better understood by reading the follow- general rule, etc.) 

ing Detailed Description of the Preferred Embodiments with a definition of a legal term of art 

reference to the accompanying drawing figures, in which a statement of the applicable standard of review 

like reference numerals refer to like elements throughout, an express statement that another case is overruled or 

and in which: disapproved 

FIG. 1 illustrates an exemplary hardware configuration in an interpretation or construction-not just a quotation-of 

which the inventive classification system and method may a statute or court rule 

be implemented. 2. The rule of law stated is significant to the court's 

FIG. 2 is a high-level flow chart schematically indicating resolution of the case, 

a training process 200 and a classification process 210, 40 3. The court expressly or implicitly adopts the stated rule 

alongside a knowledge base 201, a topic scheme 202 and of law as its own. 

exemplary lists 203, 204 that are used in the processes. A sample set of legal concepts is shown in Appendix A. 

FIG. 3 is a flow chart indicating an exemplary training A sample scheme of legal topics is shown in Appendix B. 

process 200 (FIG. 2). While this sample scheme is hierarchical in nature (with 

HG. 4 is a flow chart indicating details of an exemplary 45 general top-level topics and more specific lower-level 

legal concepts extraction step (301 from FIG. 3; 701 from ^P^^)' invention is not hmitcd to this type of scheme. 

PjQ j\ With this terminology understood, the mvention is first 

Tnr- c • fl ^u^^ ^ * 1 V 1 described in general terms, followed by a more detailed 

FIG. 5 IS a flow chart mdicating details of an exemplary exolanation 

feahire extraction step (302 from FIG. 3; 702 from FIG. 7). OverWew 

HG. 6 is a flow chart representing details of an exemplar ^° Apreferred embodiment of the present invention provides 

knowledge buildmg step 303 (HG. 3). \ ^ i^^^ concept classification system and method that ana- 

FIG. 7 is a flow chart indicating an exemplary embodi- lyze text of a concept from a legal document and provide 

ment of classification process 210 (FIG. 2). V relevant topics for the legal concept from a given legal topic 

FIG. 8 is a flow chart representing details of an exemplary 55 scheme. The invention uses a database of legal concepts 

concept classification step 703 (FIG. 7). previously classified according to the legal topic scheme, a 

FIG. 9 is a flow chart illustrating details of an exemplary list of legal phrases, and a list of stopwords. 

process 801 (FIG. 8) for ranking training concepts by Th^ preferred embodiment involves two major processes 

relevance score. to provide topics for legal concepts: training, and classify- 

FIG. 10 is a flow chart illustrating details of an exemplary 60 "ig- The system is first trained to distinguish topic trends in 

process 802 (FIG. 8) for voting to generate a list of relevant . ^^S^ concepts according to the given topic scheme. Once 

topics. trained, the system then classifies other legal concepts 

according to this same topic scheme. 

DETAILED DESCRIPTION OF THE Training Process— Overview. 

PREFERRED EMBODIMENTS ^ The process of training invoWes: 

In describing preferred embodiments of the present inven- gathering of training data, 

tion illustrated in the drawings, specific terminology is extraction of previously-classified legal concepts, 
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analysis of the "features'* in the training data, QassificatioD Process-^Overview. 

calculation of relevance scores for these features, and Once trained, the invention is used to classify previously- 

storage of this information in a knowledge base. unclassified concepts such as leg^ concepts. 

This training must be done to initially create information According to a preferred emb^ent this classification 

in the knowledge base, but it is envisioned that training may ^ P«>^f "^^ol^es a dassificaUon s^^^ that mcludes: 

occurthroughoutthelifeoflheapphcationwithfeedbackof analyzmg a "candidate' (or "target" or "inpuO legal 

newly classified legal concepts, to continually improve the concep or ea ures, . . , 

quality of the knowledge base. ^^ Tb^c ^ ' ^'^""""^ concepts m the knowl- 

A plurality of case law documents is parsed, extracting r ' 

classified legal concepts from the appropriate section of the rankmg these sunilar concepts based on sunilarity to the 

document. This requires a considerable sample of case law candidate concepts, and 

documents from which legal concepts have been extracted ^^ting to identify the most relevant topics from these 

and classified according to the same topic scheme. similar legal concepts. 

Once extracted from the case, legal concepts are analyzed candidate legal concept is analyzed for features. This 

to determine the distinguishing "features" of these concepts. ^^®P ^ identical to the analysis of features in classified legal 

In a particular embodiment, the features for the training and concepts done during the training process. The same set of 

classification processes have been identified: distinguishing features (terms, legal phrases, and case 

Jq^jj^ citations) used during training should be used in this step to 

T , < be compatible with the knowledge base, 
ijesal nnrases 

. 20 Next, the knowledge base is searched for training legal 

Case cites . , , , , concepts similar in features to the target legal concept. These 

The creation of these features from the classified lega j j ^ „^ ^^^^^ jicording 

concepts is best understood with the followmg hypothetical strength of match 

example: a legal topic, foUowed by a legal concept as might prom these matching training legal concepts, the most 

oe round in a case. ^ . . „ . 25 relevant topics for the candidate legal concept are identified. 

Crunmal Uw & Procedure-Evidence-^mion TesU- E^^h legal concept in the knowledge base has at least one 

L 1 J. t J. J 1 « ^opic associated with it, so a list of topics is generated from 

It IS tfie role of the factfinder, not the appellate court^ to the matching legal concepts. This topic list is sorted by 

judge the credibihty of witnesses, or hck thereof, and to relevance, and the most relevant topics chosen for the 

decide whether to accept such testimony or to disregard it 30 candidate legal concept. 

entirely;theappeUateoourtmustWewtheeviden«5mah The training process and the classification process 

most favorable to the jury's verdict, as noted m Smith v. (including the relevance-ranked classification step) having 

, , , . . . , . o „ , beendescribedbriefly above, the embodiments of the inven- 
This legal concept s topic (Crimind L^w & Procedure- described in greater detail. 
Evidence-Opmion Testimony) is from a hierarchical 35 The inventive method is more easily understood with 
(multi-uered) legal topic scheme. More specificaUy: references to examples. TTie examples used herein primarily 
the top-tier topic, the most general, is "Criminal Law & deal with classifying individual legal concepts of a case law 
^'^°^^'®"» document according to a particular hierarchical topic 
the 2 -tier topic, more specific, is "Evidence"; and scheme, but this does not limit the scope of the present 
the a^'-tier, the most specific, "Opinion Testimony*;. 40 invention. In fact, any size text unit such as a sentence. 
From the text of this legal concept, the term features are passage, paragraph or an entire section of a document can be 
extracted, excluding meaningless "stop-wcTrds*^. and deplu- classified according to an arbitrary topic scheme, 
ralizing to normalize terms. For example, the terms likely to ^Exemplary Hardware Embodimeiit. 
be significant features from the above sample legal concept Emboctiments of the inventive training and classification 
might be "credibility", "witness", or "disregard". The cita- 45 system may be implemented as a software system including 
tion Smith v. Ohio would be extracted as a case dte feature. a series of modules on a conventional computer. 
A legal phrase in this sample legal concept would be As shown in FIG. 1, an exemplary hardware platform 
"appellate coiu-ts". If available, opinion text identified to be includes a central processing unit 100. The central process- 
relevant to each legal concept is also scanned to find relevant ing unit 100 interacts with a human user through a user 
case citations. Of course, the exact usage of these features 50 interface 101. The user interface is used for inputting 
relative to the learning processes to be described should not information into the system and for interaction between the 
limit the scope of the invention, system and the human user. The user interface includes, for 
Once extracted and analyzed, relevance scores are calcu- example, a video display, keyboard and mouse. Memory 102 
lated for each feature in each concept. This step of the provides storage for data (such as the knowledge base, stop 
training process uses a learning step based on a relevance- 55 word list and legal phrase list) and software programs (such 
ranked text retrieval process. This step defines the relevance as the training and classification processes) that are executed 
of features using frequency of the features, both within each by the central processing imit. Auxiliary memory 103, such 
legal concept and across the entire set of training legal as a hard disk drive or a tape drive, provides additional 
concepts. These two frequencies — ^within a legal concept storage capacity and a means for retrieving large batches of 
and across the entire set — are combined to give a relevance 60 information. 

score for each feature, for each legal concept. These legal All components shown in FIG. 1 may be of a type well 

concept relevance scores are then used to identify the most known in the art. For example, the system may include a 

relevant topics for the candidate concept, during the classi- SUN workstation including the execution platform SPARC- 

fication process. system 10 and SUN OS Veision 5.5.1, available from SUN 

The features and their scores, along with links to Iheir 65 MICROSYSTEMS of Sunnyvale, Calif. Of course, the 

related legal concepts, axe then stored in the knowledge base system of the present invention may be implemented on any 

for use during the subsequent classification process. number of computer systems. Moreover, aUhough the pie- 
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ferred embodiment uses the PERL language (for text parsing FIG. 5. This step involves extracting the features necessary 

tasks) and C++ (for number crunching and database access for accurately classifying legal concepts, and is one of the 

tasks), any appropriate programming language may be used primary objects of th^ invention. The format of the feature 

to implement the present invention. when associated with the legal concept depends on the 

The Training and Qassification Processes. 5 format required by the learning method. Examples are given 

Referring to no. 2, the preferred embodiment includes a fo^ the relevance-ranked classification step, described in 

two step process as shown in which the system is first trained ^ore detail with reference to FIG. 8. 

(block 200) before it classifies 1^^^^^^ Referring more specificaUy to FIG. 5, each term in the 

Aknowledge base 201 is utilized to store training results ^ ^^^^^^^^^ ^^^^ ^ ^^^^ 

durmgUietrammgproces^^ iraimng results stored m Ae 5^ ^^^^^ ^^^^^ stop-words from the stop- 

knowledge base dunng the trammg process are used m the i a j^/^ j« 

subsequent classification process. T"'"^ ^^/f ^ ^^^""P^^ in Appendix Q are removed from 

Both the training process and the classification process , °^ , , . 

makeuseofapredeterminedtopicscheme202(seeexample , step 501, legal phrases such as "cnminal history , 

of a legal topic scheme in Appendcc B). In a particular custody dispute' , "eminent domain etc. are extracted as a 

preferred embodiment, the training process and the classi- feaUire. The legal concept is searched for legal phrases (see 

ficalion process also make use of a Stop Word List 203 (see sample in Appendix D). 

example in Appendix Q and a Phrase List 204 (see example FinaUy, step 502 involves extracting cites to other case 

of a Legal Phrase list in Appendix D). law documents, such as *Teople v. Medina (1995) 39 Cal, 

;rhe Training Process. ^ App. 4th 643, 650". 

The machine learning system is first trained, as shown in 20 After feanires have been extracted, then relevance scores 

FIG. 3 and related figures, before it classifies legal concepts. . for each feamre are generated. The generation of relevance 

This training requires case law documents with legal con- scores includes: 

cepts identified, and each concept classified according to the the conversion of features into terms, 

target topic scheme. ■ the generation of term frequencies within legal concepts 

Tne trammg process hnks the set of each legal concept s 25 across the set or 

extracted features with that legal concept's associated topic ^ . ^ ' „ ,x „ , » ^ 

(s). The foUowing discussion describes an embodiment 6f generation of (so-called) document frequencies 

the training process, based on a relevance-ranked approach. ^P^y* concept frequenaes). 

The embodiment of the concept ranking step used in the Ihe generation of inverse "document" frequencies (more 

training process uses frequency of occurrence of terms and 30 ^P^ly* inverse concept frequencies), and 

features, both within each individual legal concept arid the generation of relevance scores for each term, 

across the entire set of legal concepts. Generally, a term or Details of these steps are described as follows, with 

feature that occurs frequently within a legal concept is a reference to FIG. 6. 

strong indicator of the topic, unless that term or feature also A straightforward approach to using abstract features in a 

frequently occurs in many other legal concepts. Intuitively, 35 relevance-ranked approach is to simply convert each feature 

a highly common legal term, like ''court" or trial", does not in a legal concept into a "term" (shown as step 600). In a 

contribute to assigning a specific legal topic to a legal particular preferred embodiment, a "term" is a mnemonic 

concept. that uniquely distinguishes that feature from all other fea- 

First, a plurality of previously classified training docu- tures. 

ments are input, as shown at block 300. Then, that plurality 40 For example, the legal phrase "administrative authority" 

of case law documents is parsed to extract the legal can be easily converted to a "term" like "administrative, 

concepts, as shown at block 301. "Features" (terms, legal authority" (with an underline character between words). Or 

phrases and embedded citations) of each legal concept are a case citation like "39 Cal. App. 4th 643" can be converted 

then extracted (block 302) and attached to the text in a into a term like "39_CalApp4_643". In this way, each 

manner suitable to the learning method being used. Rel- 4S feature is converted into a term that is well-defined and 

evance scores are then generated for these features (block unique across the set. 

303). Finally, the results are stored (block 304) in knowledge Ilien, as shown in block 601, for each "term" (including 

base 201. all features as well as words), the "term frequency" (TF) is 

Referring more specifically to FIG. 3, these training steps calculated for each legal concept in which that term appears, 

are now discussed in more detail. Many of the steps in the so using the number of occurrences of that term within the legal 

training process are also used in the classification process concept. The "average term frequency" (AVE_TF) of all 

that is discussed with reference to FIG. 7. terms in the legal concept is also calculated. 

The step of extracting legal concepts is used as step 301 For each term, the total number of legal concepts in which 

during the training process shown in FIG. 3 and as step 701 a term occurs in the If training set is determined, as shown 

during the classification process shown in FIG. 7. The details 55 in block 602. This number is determined in the same way 

of steps 301 and 701 are illustrated in FIG. 4. that the conventional "document" frequency (DF) is calcu* 

Referring to FIG. 4, block 400 illustrates the partitioning lated. In the text searching art, the term "document" has 

of text accomplished when a case law document is parsed many meanings and can thus be ambiguous. In this sped- 

and partitioned by section, to identify the section containing fication "document" has already been used to refer to the 

legal concepts. 60 entire legal opinion and not to a concept that is a part of the 

Then, as shown in block 401, the legal concept section is opinion. Therefore, to avoid ambiguity, the expanded term 

parsed and partitioned into individual legal concepts. Each "document frequency" will no longer be used in this sped- 

legal concept is stored along with the topic(s) associated fication. Instead, DF will continue to be used, with the 

with that legal concept. If a legal concept has no topics from understanding that in the context of this specification DF 

the target topic scheme, then that legal concept is discarded. 65 actually refers to a concept frequency. 

The feature extraction step, used as step 302 during Block 603 represents the calculation of how widely that 

training and step 702 during classification, is detailed in term is used across the entire body of training legal concepts. 
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This calculatioD is made in the same way that the oonven- Also saved in the knowledge base axe the basic relation- 

tional inverse "document" frequency (IDF) is calculated. ships between each legal concept and its associated topic(s), 

Because "document" is used in this specification to refer to determined earlier in the general training process. This set of 

an entire legal opinion, the expanded term "inverse docu- relationships establishes the link between terms in a legal 

ment firequency" will no longer be used. However, it is 5 concept and the topics relevant to those terms, 

understood that IDF actually refers to an inverse concept The Classification Process.. 

frequency. In any event, this calculation is made using the The training process 200 having been described in detail 

DF and the total number of legal concepts in the training set above, the inventive classification process 210 is now 

DBSIZE. described. Shown in FIG. 7, the classification process 210 

Next, for each term in a legal concept, a relevance score involves classifying legal concepts of an unlmown topic 

is then calculated, as shown in block 604. This calculation according to a given topic scheme, 

involves using the term frequency (TF) for that term-legal pii^^, court case documents are input, as indicated at block 

concept pair, the AVE_TF, that term's IDF, the length of the 700^ prom each case, legal concepts are extracted, as shown 

legal concept, and the overall average length of legal con- ^ ^lock 701. The extraction step 701 may be the same as 

cepts m the set. This scoring techmque is one of the prmiary ^ ^^^^^ ^^^^^ P,^. ^ 

objects of this mvention. 15 ^ f & ™> 

Exemplary formulas for calculating the relevance score „ '° ^'"^''^ T ^ ^-"il^f used by 

the traming process, and shown m FIG. 5), features are 

extracted from each legal concept and are associated with 

that legal concept in a manner consistent with the training 

\i {dodength> aveDocLengthy. process used. 

TF+TF/AVE_TF concept classification step 703, each legal concept and 

7F+rF/AVE_TF+ identified features are first input. Information gathered 

lia+fixidoclength-aveDocUngthyaveDocUngth) d^i°g XT^ining is input from the knowledge base. TTiis 

information is used to generate a set of scores for the current 

if {doclength <= aveDocUngth): ^ legal Concept, one each for the best -matching legal concepts 

TF + TF/ AVE TF ^ training set. Tten, the topics for the legal concept are 

TFwts jf^ffipy^ . determined. Block 703 involves using the features found in 

m . o^i r. f L J I ~ t iM « . i.v the current candidate legal concept, and comparing them to 

2{fl ^pMaveDocUngth^ doclength-^ \) aveDocUngth) *i. r e j • -.u * • • 1 » . ^ 

* . the features found m the trammg legal concepts. The topics 

/DF = io 30 associated with the training legal concepts found most 

\ DF+ 0.5 ) similar to the legal concept in question are collated and 

sorted to determine the most relevant topics. 

A preferred embodiment of the concept classification step 

ecoTt^TFwtxIDF 703 (FIG. 7) is detailed in FIG. 8. The illustrated classifi- 

^Ij^j.^. 35 cation step uses frequency of features in the candidate legal 

TPwt-Term fre enc wei ht concept to find similar legal concepts (and therefore similar 

™ S equency weig . , ^ topics) in the training knowledge base. The classification 

TF-Tercn frequency within current legal concep^ ^y^^ ^e temed a rel,^ancc-ranfced dassiflcation 

AVE_TF=Average term frequency of terms m the current ^^^p^ includes* 

le^ concept . , « ^ , , 40 analyzing input concept for features, 

a,B-ScaIe factors, such that a+B-1; exemplary values are , - i?. • • . , . 

a=0 4 and poQ 6 rankmg of traimng concepts by relevance score for these 

docLength=Length of current legal concept, in characters features, and . , j . t. ... 

^ J ,u A 1 *u u . r 11 1 I voung on the topics associated with these trammg con- 

aveDocUngth^Average length, m characters, of all legal dci^rmm^ the best topics. 

concepts in traimng set ^^^^ ^ ^ ^^^^^ing "features" to "terms", 

IDF-Inverse document (i.e.. concept) frequency for may be the same as that used by FIG. 6 block 600, described 

term across training set above 

DBSIZE-Total number of legal concepts in training set in block 801, all training concepts are ranked by their 

DF="Documenf • frequency (Number of legal concepts in relevance scores for these candidate terms. 

which a term occurs) 50 A preferred embodiment of the ranking step 801 (FIG. 8) 

score«Relevanc6 score for term in a legal concept is detailed in FIG.9. Referring to FIG. 9, the first two steps, 

Finally, the results of various calculations are stored in illustrated as blocks 900 and 901, are optional but are useful 

knowledge base 201, as shown in block 304 (FIG. 3). The in optimizing for later steps. 

concept frequency DF and the relevance score for each term In block 900, the list of candidate terms is sorted in 

in a legal concept are stored in an "inverted index" in the 55 ascending order by DF, retrieved from the knowledge base, 

knowledge base. As will readily be understood by those This orders the term list from the least common terms to the 

skilled in the art, an inverted index in this context is a fist of most common terms. 

each term, each legal concept in which it occurs, and the Next, in block 901, the candidate term list is reduced to a 

relevance score for that term-legal concept, such that the list selection of the least common terms in the list. This in turn 

can be easily searched by term. 60 reduces the processing required in subsequent ranking steps; 

Significantly, use of an inverted index greatly increases the number of terms selected depends on the amount of 

the scalability of the invention. This is because an inverted optimization desired. For example, refer to Table II. If 

index provides for very efficient searching on features and optional steps 900 and 901 are not used, subsequent required 

allows for handling much larger bodies of training data. A ranking steps operate on the entire list of candidate terms, 

portion of an exemplary inverted index, given legal concepts 65 In block 902 the relevance scores for training concepts, 

23, 38, and 127, sorted by term, may be represented as in for all terms in the candidate concept, are retrieved from the 

TABLE I. knowledge base. 
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In block 903, for every training concept, the relevance 
scores for the candidate terms are summed into a total 
relevance score for that training concept. Table DI shows an 
exemplary set of five candidate terms with seven training 
legal concepts and their relevance scores. The relevance 
scores for each legal concept are totaled in the bottom row. 
This would give a list of training legal concepts with total 
relevance scores, such as shown in Table IV, 

In block 904, the training concepts are sorted descending 
by these total relevance scores, resulting in the most similar 
training concepts being first in this sorted list. For example, 
sorting Table IV by total score yields the results in Table V. 
This sorted list generally indicates that the most relevant 
training legal concept for the current candidate concept is 
Concept 1, the next most relevant concept is Concept 9, and 
so forth down the list. 

This completes discussion of the details of FIG. 8 block 
801 as it may be implemented in FIG. 9. 

Referring again to FIG. 8, in block 802, essentially a 
voting step, the topics associated with the training legal 
concepts that are most like the candidate legal concept are 
collated and sorted to generate a final list of relevant topics. 
In a final list of matching legal concepts, a single topic may 
be found more than once and therefore is more likely 
relevant to the candidate concept than topics found only 
once, for example. 

Significantly, the voting process of block 802 distin- 
guishes the more relevant topics from the less relevant 
topics. This voting technique is one of the primary objects of 
the invention. 

A preferred embodiment of the voting step 802 (FIG, 8) 
is detailed in FIG. 10. In block 1000, the topics associated 
with the sorted training concepts are retrieved from the 
knowledge base. Then, in block 1001, these training con- 
cepts and their scores are grouped by their topics. In block 
1002, the total relevance score for each topic-group is 
calculated to determine a topic relevance score for that topic. 
Finally, in block 1003, the topics are then sorted descending 
by these topic relevance scores. The resulting list shows the 
most relevant topics first in the list. 

Given a flat (non-hierarchical) topic scheme, a sample fist 
of matching concepts before voting, with their topics and 
relevance scores sorted by score, might look like TABLE 
VI. 

Thus, before voting, the most relevant topics from this list 
would be, in order of appearance: 
Admiralty Law 
Transportation Law 
Torts 

Bankruptcy Law 

After voting, the scores of all the topics are accumulated 
and the results re-sorted, would look like TABLE VII. So the 
list of relevant topics after voting would be sorted differendy 
than that before voting: 

Transportation Law (6.53) 

Admiralty Law (4.55) 

Torts (2.81) 

Bankruptcy Law (0.68) . 

The "Transportation Law" topic becomes the most rel- 
evant topic after voting because its accumulated relevance 
score is higher than the accumulated scores of the other 
topics. 

If the legal topic scheme is a hierarchical (multi-tier) topic 
scheme, a second hierarchical voting may be perfonned on 
the final relevant topics. The final topic list can be grouped 
by l"-lier topic, by 2"^-tier topic, and so on, and then 



weighted according to occurrence at each tier. These weights 
are then considered in the final list of topics. This technique 
takes into consideration similar topics and can help to 
improve overall quality of the topics. 

For example, die list of matching concepts, before hier- 
archical voting, with their topics and relevance scores, 
sorted by score, might look like TABLE VIII. 

If the relevance scores for these topics are accuimulated, 
first by Tier 1 to give CumTl, then by Her 2 to give CumT2, 
then by Her 3 to give CumT3 (assume only a 3-tier 
hierarchy for this example), the topic list looks like TABLE 
IX. 

This list is then sorted highest to lowest score, first by Tier 
1, then by Tier 2, and then by Tier 3, to give TABLE X. 

This would be the final list of most relevant topics, sorted 
by relevance. This final list of topics might be represented 
hierarchically as follows: 
Transportation Law 

Water Transportation & Shipping 

Vessel Safety 
Carrier Labilities & Duties 

Ratemaking 
Vehicle Transportation & Shipping 

Traffic Regulation 
Intrastate Commerce Regulation 
Foreign Commerce Regulation 
Admiralty Law 

Cargo Care & Custody 
Liability Exemptions 
Liability 
Non-Cargo Liabifity 
Death Actions 
Jones Act 
Torts 

Products Liability 

Negligence 
Strict Liability 

Abnormally Dangerous Activities 
N^carious Liability 
Negligent Hiring & Supervision 
Bankruptcy Law 

Property Use, Sale, or Lease 
Still other (optional) techniques can be used to further 
eliminate irrelevant topics. For example, if a topic's rel- 
evance score is below a predefined threshold, or if the 
number of times the topic occurs among the most relevant 
legal concepts is below a threshold, then that topic could be 
eliminated. 

The inventive methods having been described above, the 
invention also encompasses apparatus (especially program- 
mable computers) for carrying out classification of legal 
concepts. Further, the invention encompasses articles of 
manufacture, specifically, computer readable memory on 
which computer-readable code embodying the methods may 
be stored, so that, when the code is used in conjunction with 
a computer, the computer can carry out the U-aining and 
classification processes. 

A non-limiting, illustrative example of an apparatus that 
invention envisions for executing the foregoing methods, is 
described above and illustrated in FIG. 1: a computer or 
other progranmiable apparatus whose actions are directed by 
a computer program or other software. 

Non-limiting, illustrative articles of manufacture (storage 
media with executable code) may include the memory 103 
65 (FIG. 1), other magnetic disks, optical disks, **floppy" 
diskettes, ZIP disks, or other magnetic diskettes, magnetic 
tapes, and the like. Each constitutes a computer readable 
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memory that can be used to direct a computer to function in 
a particular manner when used by the computer. 

Those skilled in the art, given the preceding description of 
the inventive method, are readily capable of using knowl- 
edge of hardware, of operating systems and software 
platforms, of programming languages, and of storage media, 
to make and use apparatus for concept classification, as well 
as computer readable memory articles of manufacture 
which, when used in conjunction with a computer can carry 
out concept classification. Thus, the invention's scope 
includes not only the method itself, but apparatus and 
articles of manufacture. 

Modifications and variations of the above-described 
embodiments of the present invention are possible, as appre- 
ciated by those skilled in the art in light of the above 
teachings. For example, the particular hardware on which 
the system and method are implemented, the programming 
languages and data formats involved, the inclusion or exclu- 
sion of optional steps, the nature of the concepts to be 
classified, the particular topic scheme used, and other details 
of implementation, may be varied while remaining within 
the scope of the present invention. It is therefore to be 
understood that, within the scope of the appended claims and 
their equivalents, the invention may be practiced otherwise 
than as specifically described. 



TABLE I 


Tbrm 


Concept 


Score 


i^ipellate 


23 


0.08 




38 


0.12 


couit 


23 


0.01 


court 


127 


0.02 


court 


38 


0.01 


credibility 


23 


0.23 


Cactfmder 


23 


1.13 


fectiinder 


127 


1.04 


judqe 


38 


0.05 


role 


23 


0.57 


lole 


38 


0.88 


role 


127 


0.42 


witnesses 


23 


0.11 


witnesses 


38 


0.08 



14 



TABLE lU 



Tiaining Legal Concept 



5 


Term 


1 


2 


4 


9 


10 


13 


15 




seaworthy 


0.9997 


0 


0 


0.6458 


0 


0 


0 




seaworthy 


0.9997 


0 


0 


0.6458 


0 


0 


0 




admiralty 


0.8023 


0 


0 


0.9322 


0.7083 


0 


0.4016 




admiralty 


0.8023 


0 


0 


0.9322 


0.7083 


0 


0.4016 


10 


admiralty 


0.8023 


0 


0 


0.9322 


0.7083 


0 


0.4016 




admiralty 


0.8023 


0 


0 


0.9322 


0.7083 


0 


0.4016 




negligent 


0.7922 


0 


0 


0.5738 


0 


0 


0.7992 




negligent 


0.7922 


0 


0 


0.5738 


0 


0 


0.7992 




injure 


0 


0 


0 


0 


0.9955 


0 


0 




injure 


0 


0 


0 


0 


0.9955 


0 


0 


15 


witness 


0.1047 


0.0882 


0 


0.0023 


0 


0.4771 


0.1025 


witness 


0.1047 


0.0882 


0 


0.0023 


0 


0.4771 


0.1025 




witness 


0.1047 


0.0882 


0 


0.0023 


0 


0.4771 


0.1025 




witness 


0.1047 


0.0882 


0 


0.0023 


0 


0.4771 


0.1025 




witness 


0.1047 


0.0882 


0 


0.0023 


0 


0.4771 


0.1025 




witness 


0.1047 


0.0882 


0 


0.0023 


0 


0.4771 


0.1025 


20 


TOTALS 


7.4212 


0.5292 


0 


6.1818 


4.8242 


2.8626 


3.8198 
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TABLE IV 



Training 
Legal 
Concept 



Total Score 



1 


7,4212 


2 


0.5292 


4 


0 


9 


6.1818 


10 


4.8242 


13 


2.8626 


15 


3.8198 



TABLE V 



Training 
Legal 
Concept 



Tbtal Score 



1 
9 
10 
15 
13 
2 
4 



7.4212 
6.1818 
4.8242 
3.8198 
2.8626 
0.5292 
0 



TABLE II 



50 



TABLE VI 



Tbrm 


DF 




Cbncq)t 


Ibpic 


Score 


seaworthy 


1584 




1 


Admiralty Law 


1.53 


seaworthy 


1584 




2 


Admiralty Law 


1.27 


admiralty 


8092 


55 


3 


Transpoitation Law 


0.89 


admiralty 


8092 


4 


Torts 


0.88 


admiralty 


8092 




4 


Transportation Law 


0.88 


admiralty 


8092 




5 


Transportation Law 


0.82 


negligent 


12744 




6 


Torts 


0.79 


negligent 


12744 




7 


Transportation Law 


0.74 


injure 


28939 


60 


8 


Transportation Law 


0.74 


iigure 


28939 


9 


Admiralty Law 


0.69 


witness 


42677 




10 


Bankruptcy Law 


0.68 


witness 


42677 




11 


TVansportation Law 


0.68 


witness 


42677 




12 


Transportation Law 


0.65 


witness 


42677 




12 


Tbrts 


0.59 


witness 


42677 


65 


13 


Transportation Law 


0.58 


witness 


42677 


14 
14 


Transportation Law 
Admiralty Law 


0.55 
0.55 
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TABLE VI<ontiDued 



Concept 


Topic 


Score 


14 


Torts 


0.55 


15 


Admiralty Law 


0.51 


TABLE VII 


Concept 


Topic 


Score 


1 


Admiralty Law 


1,53 


2 


Admiralty Law 


1.27 


9 


Admiralty Law 


0.69 


14 


Admiralty Law 


0.55 


15 


Admiralty Law 


0.51 






4.55 


4 


Tbits 


0.88 


6 


Ibits 


0.79 


12 


Torts 


0.59 



5 



10 



15 



20 



TABLE VH-continued 



Concept 


Ibpic 


Score 


14 


lb Its 


0.55 






2.81 


3 


TraospOftatioD Law 


0.89 


4 


Transportation Law 


0.88 


5 


Transportation Law 


0.82 


7 


Transportation Law 


0.74 


8 


Transportation Law 


0.74 


10 


Transportation Law 


0.68 


11 


Transportation Law 


0.65 


13 


Transportation Law 


0.58 


14 


Transportation Law 


0.55 






6.53 


10 


Bankruptcy Law 


0.68 






0.68 



TABLE Vin 



Concept Ibpic 


Score 


1 


Admiralty Law — Cargo Care & Custody - Liability Eicemptions 


103 


2 


Admiralty Law « Non-Cargo Liability - Death Actions 


1.33 


3 


Transportation Law - Vehicle Tiransportation & Shqiping TrafGc Regulation 


0.89 


4 


Torts " Strict Liability Abnormally Dangerous Activities 


0.88 


4 


Transportation Law - Water IVansportation & Shipping - ^^ssel Safety 


0.88 


5 


Transportation Law - Oirrier Liabiiites & Duties - Ratemaking 


0.82 


6 


Torts - Products Liability Negli^noe 


0.79 


7 


Transportatioa Law - Intrastate Commerce Regulation 


a74 


8 


Transportation Law - Water Transportation & Shipping - VbsscI Safety 


0.74 


9 


Admiralty Law - Cargo Care & Custody ~ Liability Exemptions 


0.69 


10 


Bankruptcy Law ~ Property Use, Sale, or Lease 


0.68 


10 


Transportation Law ~ Foreign Commerce Regulation 


0.68 


11 


Transportation Law - Water Transportation & Shipping -- Vessel Safety 


0.65 


12 


Torts - Products Liability - Negligence 


0,59 


13 


Transportation Law - Water Transportation & Shipping -- Vessel Safety 


0.58 


14 


Transportation Law - Carrier Liabiiites & Duties - Ratemaking 


0.55 


14 


Admiralty Law - Cargo Care & Custody — Liability Exemptions 


0.55 


14 


Torts -- Vicarious Liability - Negligent Hiring & Supervision 


0.55 


15 


Admiralty Law - Non-Cargo Liability - Jones Act 


0.51 



TABLE IX 



Concept Topic 


Score 


CumTl 


CumTZ 


CumT3 


9 


Admiralty Law - Cargo Care & Custody - Liability 


0.69 






0.69 


■ 1 


Admiralty Law - Cargo Care & Custody — Liability Exemptions 


2.03 








14 


Admiralty Law - Cargo Care & Custody — Liability Exemptions 


055 




3.27 


258 


2 


Admiralty Law - Non-Oargo Liability — Death Actions 


1J3 






03 


15 


Admiralty 1-aw — Non-Cargo Liability — Jones Act 


051 


5.11 


L84 


051 


10 


Bankruptcy Law - Prq)erty Use, Sale, or Lease 


0.68 


0.68 


0.68 


0.68 


6 


Tbrts - Products Liability - Negligence 


0.79 








12 


Tbrts - Products Liability - Negligence 


059 




1.38 


1.38 


4 


Torts -- Strict Liability ~ Abnormally Dangerous Activities 


0.88 




0.88 


0.88 


14 


Tbrts - Vicarious Liability - Negligent Hiring & Supervision 


055 


2.81 


0.55 


055 


5 


Transportation Law - C&rrier Liabilities & Duties - Ratemaking 


0.82 








14 


Transportation Law - Qinier Liabilities & Duties - Ratemaking 


055 




1.37 


137 


10 


Transportation Law - Foreign Commerce Regulation 


0.68 




0.68 


0.68 


7 


Transportation Law - Intrastate Commerce Regulation 


0.74 




0.74 


0.74 
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TABLE IX-continued 

Concept Topic Score CumTl CuinT2 QimT3 

3 Ihuisportation Uw Vehicle TlRmsportation & Shipping - Tramc Regulation 0.89 0.89 0.89 

4 Tiraasportation Law - Water Transportation & Shipping - Vessel Safety 0.88 
8 Thnsportation Law - Water Transportation & Shipping - Vessel Safety 0.74 

11 lYansportation Law - Water Transportation & Shipping - Vessel Safety 0.65 

13 T^nsportalion Law «- Water Thinsportation & Shipping - Vessel Safety 0.58 6.53 2.85 2.85 - 



TABLE X 



Topic 


CumTl 


ClimT2 


CumT3 


Transportation Law - Water Transportation & Shipping — Vessel Safety 


6.53 


2.85 


2.85 


Transportation Law - Carrier Liabilities & Duties — Ratemaking 




1.37 


1.37 


Transportation Law - Vehicle Transportation & Shipping - Traffic Regulation 




0.89 


0.89 


Transportation Law - Intrastate Commerce Regulation 




0.74 


0.74 


Transportation Law - Foreign Commerce Regulation 




0.67 


0.67 


Admiralty Law - Cargo Care & Custody ~ Liability Exemptions 


5.11 


3.27 


2.58 


Admiralty Law - Cargo Care & Custody -- Liability 






0.69 


Admiralty Law - Non-Chrgo Liability - Death Actions 




1.33 


1.33 


Admiralty Law ~ Non-Oargo Liability ~ Jones Act 






OJl 


Torts - Products Liability - Negligence 


2.81 


1.38 


1.38 


Torts — Strict Liability — Abnormally Dangerous Activities 




0.88 


0.88 


Torts — Vicarious Liability - Negligent Hiring & Supervision 




0.55 


0.55 


Bankruptcy Law — Property Use, Sale, or Lease 


0.68 


0.68 


0.68 



APPENDICES 

APPENDIX A - LEGAL CONCEPTS 

APPENDIX B - HIERARCHICAL TOPIC 
SCHEME 

APPENDIX C - STOP-WORD UST 

APPENDIX D - LEGAL PHRASE LIST 



Concerning the content of the following y^pendices, see 
the copyright notice at the beginning of the specification. 40 

APPENDIX A 
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35 



APPENDIX A-continued 



Exemplary LEGAL CONCEPTS (Points of Uw, etc.) 
Concepts Listed beneath respective Legal Topics from Hierarchical 
Legal Tbpic Scheme 

1. Civil Procedure- Injunctions-Permanent Injunctions 

Civil Procedure- Appeals-Standards of Review- Abuse* of Discretion 
The grant or denial of an injunction is solely within the trial court's 
discretion and, therefore, a reviewing court should not disturb the 
judgment of the trial court absent a showing of a clear abuse of 
discretion. An abuse of discretion involves more than an error of 
judgment. It is an attitude on the part of the court that is 
unreasonable, unconscionable, or arbitrary. 

2. Civil Procedure- Appeals- Standards of Review-Abuse of Discretion 
A reviewing court should presume that the trial court's findings are 
accurate, since the trial judge is best able to view the witnesses and 
observe their demeanor, gestures, and voice inflections and use these 
<^servations in weighing the credibility of the witnesses. 

3. Contract Law-Consideration-Mutual Obligation 

For a contract to be enforceable, it must be supported by 
consideration. 

4. Labor & Employment Law-Employment at Will 

Since an employer is not legally required to continue the employ- 
ment 

of an employee at-will, continued employment is consideration for 
the contract not to compete. 

5. Labor & Employment Law-Thide Secrets & Unfair Con^tition— 
Nonoompetition Agreements 

A covenant restraining an employee from competing with his former 
employer upon termination of employment is reasonable if it is no 



45 



50 



55 



60 



65 



Easemphiry LEGAL CONCEPTS (Points of Law, etc) 
Concepts Listed beneath respective Legal Ibpics from HieiBichical 
Legal Topic Scheme 

greater than is required for the protection of the employer, does 
not in^ose undue hardship on the employee, and is not injurious to 
the public. Only those covenants which are reasonable will be 
enforced. 



APPENDIX B 

Section from an eiemplary 
HIERARCHICAL LEGAL TOPIC SCHEME 

Admiralty Law 
Arbitration 
Bankni^tcy 
Bills of Lading 
Cargo Care & Oistody 
Due Diligence 
General Average 
Liability 

Liability Exemptions 
Limitation of liability 
Charter Parties 
Contribution & Indemnity 
Insurance 
Jurisdiction 
Law of Salvage 
liens & Mortgages 
Negligence & Unseaworthiness 
Non-Cargo Liability 
Death Actions 
Jones Act 

Longshore & Harbor Workers* Compensation Act 

Penalties & Forfeitures 

Antitrust & Ttade Regulation 

Sherman Act 

Clayton Act 

Robinson-Patman Act 

Federal Tinde Commission Act 

Market Definition 
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APPENDIX B-continued 



APPENDIX D-continued 



Section from an exemplary 
HIERARCHICAL LEGAL TOPIC SCHEME 



Restraints of Trade & Price Fixing 
Exclusive or Reciprocal Dealing 
Horizontal Market Allocation 
Horizontal Refusals to Deal 
Horizontal Restraints 
Per Se Rule & Rule of Reason 
Tying Arrangements 
Vertical Price Restraints 

[items purposely omitted] 



APPENDIX C 



Selection from an exemplary 
STOP- WORD LIST 



A 

ABLE 

ABOUT 

ABOVE 

ACCORDING 

ACROSS 

AFTER 

AGAIN 

AGAINST 

AGO 

ALL 

ALLOW 

ALLOWED 

ALLOWING 

ALLOWS 

ALMOST 

ALONE 

ALONG 

ALREADY 

ALSO 

AUHOUGH 
ALWAYS 

[Items purposely omitted] 

WHATEVER 

WHEN 

WHERE 

WHETHER 

WHICH 

WHILE 

WHO 

WHOLE 

WHOSE 

WHY 

WILL 

WTTH 

WmflN 

WTTHOUT 

WON 

WOULD 

WOULDN 

YESTERDAY 

YET 

YOU 

YOUR 

YOURS 

YOURSELF 



APPHWIX D 



Selection from an exemplary 
LEGAL PHRASE UST 
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14th amendment 
4th ed 
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Selection from an exemplary 
LEGAL PHRASE UST 



5th ed 
6th ed 
8th ed 

abatement act 
abettor statute 
abnormal sexual interest 
absent class member 
absent evidence 
absolute bar 
absolute discretion 
absolute divorce 
absolute duty 
absolute equality 
absolute immunity 
absolute priority rule 
absolute privilege 
absolute right 
absolute title 
abstention principle 
abstract book 

abuse-of-discretion standard 
ad damnum 

ad damnum clause 
ad valorem 
ad valorem tax 
additional evidence 
additional &ct 

additional peremptory challenge 
additional punishment 
additional suspect 
adequate consideration 
adequate notice 
adequate record 
adequate remedy 

[items purposely omitted] 
zoning appeal 
zoning aiUhority 
zoning case 
zoning enabling act 
zoning law 
zoning regulation 



40 



We claim: 

1. A computer-implemented method of building a knowl- 
edge base for a legal topic classification system, £e method 
comprising: 

inputting a plurality of training documents; 

parsing the plurality of training documents to extract 
classified legal concepts; 

extracting features from the legal concepts; 

generating relevance scores for each feature; and 

Storing features, topics, and relevance scores in a knowl- 
edge base, using an inverted index. 

2. The method as set forth in claim 1, the step of parsing 
comprising the steps of: 

partitioning the text by section; and 
partitioning the text by legal concept. 

3. The method as set forth in claim 1, the step of extracting 
features comprising the steps of: 

extracting terms, excluding stop words; 
extracting legal phrases; and 
extracting embedded case citations. 

4. The method as set forth in claim 1, the step of 
generating relevance scores including the steps of: 

converting feahires to terms; 

generating, for each training concept, term frequency (TF) 
for each term, as number of occurrences of that term in 
that training concept; 
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generating, for each training concept, document fire- reducing candidate term list to least common terms, 

quency (DF) for each term, as total number of training 11. The method as set forth in claim 8, the step of voting 

concepts in which term appears; including the steps of: 

generating inverse document frequency (IDF) for each retrieving topics associated with each training concept 

term; and 5 from a knowledge base; 

generating a relevance score for each term for each grouping training concepts and scores by associated top- 
concept, ics; 

5. A computer-implemented method of building a knowl- calculating a total topic relevance score for each topic, as 
edge base for a legal topic classification system, the method a sum of training concept scores for each topic; and 
compnsmg: sorting topics by total topic relevance score to create a 

analyzing previously classified legal concepts to deter- topic list. 

mine distinguishing features for each concept; 12. The method as set forth in claim 11, further 

generating relevance scores for each feature in each comprising, within a hierarchical topic scheme, the steps of: 

training concept; and grouping topics by tier; 

storing features, topics, and relevance scores in a knowl- weighting the topic list according to number of occur- 

edge base, using an inverted index. rences of each tier topic; 

6. The method as set forth in claim 5, the step of generating a final topic list using the weighted topic Ust; 
generating relevance scores including the steps of: 

converting feamres to terms; 20 sorting the final topic list by tier. 

generating, for each training concept, term firequency (TF) 13. The method as set forth in claim 11, the step of sorting 

for each term, as number of occurrences of that term in including comparing each total topic relevance score to a 

that training concept; threshold and ehminating from the topic list those topics 

generating, for each training concept, average term fre- having a total topic relevance score below the threshold. 

quency of terms; 25 14. xhe method as set forth in claim U, the step of sorting 

generating, for each training concept, document fi:e- including the steps of: 

quency (DF) for each term, as total number of training determining a number of times each topic occurs; 

concepts in which term appears; comparing the mmiber to a threshold; and 

determining DBSIZE as total number of training concepts eliminating firom the topic list those topics having a 

in knowledge base; number of occurrences below the threshold, 

generating inverse document frequency (IDF) for each 15. A computer-implemented method of processing an 

term; and input concept from a document text to provide, firom a topic 

generating a relevance score for each term for each scheme incorporating a plurality of training concepts, a list 

concept. 35 of on& or more topics that are relevant to the input concept, 

7. The method as set forth in claim 6, wherein the step of the method comprising: 

generating IDF is performed using the formula, log retrieving topics associated with the training concepts 

((DBSIZE-DF+0.5)/(DF+0.05)). from a knowledge base, the training concepts having 

8. A computer-implemented method of processing an been previously classified and scored in accordance 
input concept from a document text to provide, from a topic 40 with the topic scheme; 

scheme, a list of one or more topics that are relevant to the grouping training concepts and scores by associated top- 
input concept, the method comprising: ics; 

analyzing the input concept to arrive at a set of distin- calculating a total topic relevance score for each topic, as 

guishing features; a sum of training concept scores for each topic; and 
converting candidate concept features to candidate terms; 45 sorting topics by total topic relevance score to create a 

searching a database of concepts, previously classified topic list relevant to the input concept. 

according to the topic scheme, for concepts similar to 16. The method as set forth in claim 15, further 

the input concept based on features; comprising, within a hierarchical topic scheme, the steps of: 

ranking the similar concepts based on relevance score; grouping topics by tier; 

and weighting the topic list according to number of occur- 

voting on topics associated with the concepts within the rences of each tier topic; 

database to form the list of topics relevant to the input generating a final topic list using the weighted topic list; 

concept. and 

9. The method as set forth in claim 8, the step of ranking sorting the final topic list by tier. 

including the steps of: 17. a computer-implemented method of processing an 

retrieving, for each training concept, relevance scores input concept from a document text to identify, within a 

from a knowledge base for all candidate terms; knowledge base incorporating a plurality of training 

calculating total relevance score for each training concept, concepts, concepts similar to the input concept and to rank 

as a sum of candidate term relevance scores for that these similar concepts, the method comprising: 

concept; and identifying features of the input concept as candidate 

sorting U-aining concepts by total relevance scores. terms; 

10. The method as set forth in claim 9, the step of ranking retrieving, from the knowledge base, relevance scores for 
further including, before the step of retrieving, the steps of: training concepts similar to the input concept; 

sorting candidate terms by document frequency (DF) of 65 calculating a total relevance score for each retrieved 

each term, as number of knowledge base training training concept, as a sum of candidate term relevance 

concepts in which term occurs; and scores for that concept; and 
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sorting retrieved training concepts by total relevance average length of concepts in a set, aveDocLength, the 

scores. relevance score is calculated using the formula TFwtxIDF, 

18. A computer-implemented method of building a ^yhere 
knowledge base for a legal topic classification system by 

identifying features within previously classified training 5 rr + TF/AVETF 

concepts and generating relevance scores for these features, tfw/= TF-t-TF/AVE TF+ 

the method comprising the steps of: i i. ,~ tw ^ i i.v 

converting the features into terms; 
generating, for each training concept, term firequency (TF) 

for each term, as number of occurrences of that term in and IDF«log (PBSIZE-DF+0.5)/(DF+0.05)). 
that training concept; 20. The method as set forth in claim 18, wherein when a 

generating, for each training concept, average term fire- length of a current concept, doclenglh, is less than or equal 

quency ( AVE_TF) of terms; iq ^n average length of concepts in a set, aveDocLength, the 
generating, for each training concept, document fire- 15 relevance score is calculated using the formula TFwtxIDF, 

quency (DF) for each term, as total number of Gaining where 
concepts in which term appears; 
determining training set DBSIZE as total number of rF+TF/AVELTF 
training concepts in the knowledge base; 7Fwr= rF-».7F/AV^_TF+ 

generating inverse document frequency (IDF) for each 20 2(a+fii<(aveDociJ!ngth-doclength-i-i)faveDocUngih) 
tenn; and 

generating a relevance score for each term for each , ^ . 

^ concept and IDF^log ((DBSIZE-DF+0.5)/(DF+0.05)). 

19. The method as set forth in claim 18, wherein when a 

length of a ciurent concept, doclength, is greater than an ♦ * » * » 
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